Introduction

The data for this project was provided by Starbucks for the Udacity Data Scientist Nanodegree program. It is simulated (rather than actual) data that mimics reward offers and customer purchasing behavior over a 30-day test period.

The main question I would like to answer with this project is:

  • What offer should I recommend based on an individual's demographic data?

In order to answer this question, I plan to create a machine learning model that predicts transaction amounts based on demographic data and reward type. Using this model, I can identify which offer maximizes a member's predicted transaction and then recommend making that offer to them.

Three datasets are provided:

portfolio.json

Offers sent during a 30-day test period

  • id (string) - offer id
  • offer_type (string) - type of offer ie BOGO, discount, informational
  • difficulty (int) - minimum required spend to complete an offer
  • reward (int) - reward given for completing an offer
  • duration (int) - time for offer to be open, in days
  • channels (list of strings)

profile.json

Rewards program users

  • age (int) - age of the customer
  • became_member_on (int) - date when customer created an app account
  • gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
  • id (str) - customer id
  • income (float) - customer's income

transcript.json

Event log

  • event (str) - record description (ie transaction, offer received, offer viewed, etc.)
  • person (str) - customer id
  • time (int) - time in hours since start of test. The data begins at time t=0
  • value - (dict of strings) - either an offer id or transaction amount depending on the record
In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.legend_handler import HandlerTuple
from datetime import datetime
import progressbar
import pickle
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, make_scorer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, BaggingRegressor, GradientBoostingRegressor
from sklearn.svm import SVR, LinearSVR
from sklearn.ensemble import VotingRegressor

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
In [2]:
# read in datasets
portfolio = pd.read_json('data/portfolio.json',orient='records',lines=True)
profile = pd.read_json('data/profile.json',orient='records',lines=True)
transcript = pd.read_json('data/transcript.json',orient='records',lines=True)
In [3]:
# Preview each dataframe and display the shape
display(portfolio.head())
print('portfolio shape:',portfolio.shape)
reward channels difficulty duration offer_type id
0 10 [email, mobile, social] 10 7 bogo ae264e3637204a6fb9bb56bc8210ddfd
1 10 [web, email, mobile, social] 10 5 bogo 4d5c57ea9a6940dd891ad53e9dbe8da0
2 0 [web, email, mobile] 0 4 informational 3f207df678b143eea3cee63160fa8bed
3 5 [web, email, mobile] 5 7 bogo 9b98b8c7a33c4b65b9aebfe6a799e6d9
4 5 [web, email] 20 10 discount 0b1e1539f2cc45b7b9fa7c272da2e1d7
portfolio shape: (10, 6)
In [4]:
display(profile.head())
print('profile shape:',profile.shape)
gender age id became_member_on income
0 None 118 68be06ca386d4c31939f3a4f0e3dd783 20170212 NaN
1 F 55 0610b486422d4921ae7d2bf64640c50b 20170715 112000.0
2 None 118 38fe809add3b4fcf9315a9694bb96ff5 20180712 NaN
3 F 75 78afa995795e4d85b5d9ceeca43f5fef 20170509 100000.0
4 None 118 a03223e636434f42ac4c3df47e8bac43 20170804 NaN
profile shape: (17000, 5)
In [5]:
display(transcript.head())
print('transcript shape:',transcript.shape)
person event value time
0 78afa995795e4d85b5d9ceeca43f5fef offer received {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} 0
1 a03223e636434f42ac4c3df47e8bac43 offer received {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} 0
2 e2127556f4f64592b11af22de27a7932 offer received {'offer id': '2906b810c7d4411798c6938adc9daaa5'} 0
3 8ec6ce2a7e7949b1bf142def7d0e0586 offer received {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} 0
4 68617ca6246f4fbc85e91a2a49552598 offer received {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} 0
transcript shape: (306534, 4)

Data Exploration

Exploring the portfolio dataframe is easy because there are only 10 rows and 6 columns so the entire dataframe can be easily visualized.

In [6]:
# display the entire portfolio dataframe
portfolio
Out[6]:
reward channels difficulty duration offer_type id
0 10 [email, mobile, social] 10 7 bogo ae264e3637204a6fb9bb56bc8210ddfd
1 10 [web, email, mobile, social] 10 5 bogo 4d5c57ea9a6940dd891ad53e9dbe8da0
2 0 [web, email, mobile] 0 4 informational 3f207df678b143eea3cee63160fa8bed
3 5 [web, email, mobile] 5 7 bogo 9b98b8c7a33c4b65b9aebfe6a799e6d9
4 5 [web, email] 20 10 discount 0b1e1539f2cc45b7b9fa7c272da2e1d7
5 3 [web, email, mobile, social] 7 7 discount 2298d6c36e964ae4a3e7e9706d1fb8c2
6 2 [web, email, mobile, social] 10 10 discount fafdcd668e3743c1bb461111dcafc2a4
7 0 [email, mobile, social] 0 3 informational 5a8bc65990b245e5a138643cd4eb9837
8 5 [web, email, mobile, social] 5 5 bogo f19421c1d4aa40978ebb69ca19b0e20d
9 2 [web, email, mobile] 10 7 discount 2906b810c7d4411798c6938adc9daaa5

There are 10 different offers: 4 bogo's, 4 discounts, and 2 informational types. All of them use email as a channel, which means that including email doesn't provide a way to distinguish any of the offer types. I can drop this value from the 'channels' feature to simplify the analysis.

For the profile dataframe, lets first look at null values. Missing values in 'age' are represented by the value 118, while in 'gender' they are represented by the string 'None'. I will map these values to NaN's and then count the null values in each column.

In [7]:
# Map 118 to NaN in 'Age' column
profile['age'] = profile['age'].apply(lambda x: np.nan if x==118 else x)
In [8]:
profile.head()
Out[8]:
gender age id became_member_on income
0 None NaN 68be06ca386d4c31939f3a4f0e3dd783 20170212 NaN
1 F 55.0 0610b486422d4921ae7d2bf64640c50b 20170715 112000.0
2 None NaN 38fe809add3b4fcf9315a9694bb96ff5 20180712 NaN
3 F 75.0 78afa995795e4d85b5d9ceeca43f5fef 20170509 100000.0
4 None NaN a03223e636434f42ac4c3df47e8bac43 20170804 NaN
In [9]:
# Map 'None' to NaN in 'gender column'
profile['gender'] = profile['gender'].apply(lambda x: np.nan if x==None else x)
profile.head()
Out[9]:
gender age id became_member_on income
0 NaN NaN 68be06ca386d4c31939f3a4f0e3dd783 20170212 NaN
1 F 55.0 0610b486422d4921ae7d2bf64640c50b 20170715 112000.0
2 NaN NaN 38fe809add3b4fcf9315a9694bb96ff5 20180712 NaN
3 F 75.0 78afa995795e4d85b5d9ceeca43f5fef 20170509 100000.0
4 NaN NaN a03223e636434f42ac4c3df47e8bac43 20170804 NaN
In [10]:
# count the number of null values in each column of the profile dataframe
profile.isnull().sum()
Out[10]:
gender              2175
age                 2175
id                     0
became_member_on       0
income              2175
dtype: int64
In [11]:
# now count the number of rows with 3 null values
profile.isnull().sum(axis=1).value_counts()
Out[11]:
0    14825
3     2175
dtype: int64

Of the 17,000 users in the profile dataset, 2,175 are missing age, gender, and income data while the remaining 14,825 users are not missing any demographic data. The only identifying feature for the 2,175 users is the date they joined the Starbucks rewards program. This could be a useful feature later in the analysis, but if it isn't, then I'll feel confident dropping the users with missing values since I'll still have a majority of the dataset with complete information.

I also want to visualize the demographic data and calculate some descriptive statistics for each demographic feature.

In [12]:
# plot a histogram of the 'age' feature
fig, ax = plt.subplots(figsize=(16,9))
ax.hist(profile['age'],bins=range(10,120,10))
ax.set_xlabel('Age (years)',fontsize=12)
ax.set_ylabel('Count',fontsize=12)
ax.set_title('Distribution of Starbucks Rewards Member Ages',fontsize=16)
plt.show()
In [13]:
# plot a histogram of the 'became_member_on' feature
fig, ax = plt.subplots(figsize=(16,9))
ax.hist(profile['became_member_on'],bins=range(20130000,20200000,10000))
ax.set_xticks(range(20130000,20200000,10000))
ax.set_xlabel('Enrollment year',fontsize=12)
ax.set_ylabel('Count',fontsize=12)
ax.set_title('Distribution of Starbucks Rewards Member Enrollment',fontsize=16)
plt.show()
In [14]:
# plot a pie chart of the 'gender' feature
gender_labels = profile['gender'].value_counts().index.map({'M':'Male','F':'Female','O':'Other'})
fig, ax = plt.subplots(figsize=(16,9))
ax.pie(profile['gender'].value_counts(),labels=gender_labels,autopct='%1.f%%')
ax.set_title('Breakdown of Starbucks Rewards Member Genders',fontsize=16)
plt.show()
In [15]:
# plot a histogram of the 'income' feature
fig, ax = plt.subplots(figsize=(16,9))
ax.hist(profile['income'],bins=range(0,140000,20000))
ax.set_xlabel('Income',fontsize=12)
ax.set_ylabel('Count',fontsize=12)
ax.set_title('Distribution of Starbucks Rewards Member Income',fontsize=16)
plt.show()
In [16]:
# Generate descriptive statistics of the demographic data
profile.describe()
Out[16]:
age became_member_on income
count 14825.000000 1.700000e+04 14825.000000
mean 54.393524 2.016703e+07 65404.991568
std 17.383705 1.167750e+04 21598.299410
min 18.000000 2.013073e+07 30000.000000
25% 42.000000 2.016053e+07 49000.000000
50% 55.000000 2.017080e+07 64000.000000
75% 66.000000 2.017123e+07 80000.000000
max 101.000000 2.018073e+07 120000.000000

Visualizations and descriptive statistics of the demographic data in the profile dataset have given me a better understanding of the makeup of this user group. The rewards members range in age from 18 to 101 years with a mean age of ~54 years. The minimum age makes sense since it is likely an individual must be 18 years or older to join the rewards program, but I am a little surprised that there is a 101-year-old member and that the age distribution is centered around 55. I expected the distribution to be centered around users in their 20s and 30s. Member incomes range from \$14,825 to \\$120,000 with a mean of ~$65,400, which falls in line with my expectations for income distribution in the U.S. A majority (57\%) of members are male, which also surprised me since I expected it to be closer to a 50/50 split with female or skewed towards female consumers. Finally, membership enrollment ranges from July 2013 to July 2018 with enrollment numbers increasing at an increasing rate from 2013 through 2017 before decreasing in 2018. The decrease in 2018 is likely due to capturing an incomplete year of member enrollments. This trend in enrollments indicates that the Starbucks Rewards program is increasing in popularity and therefore there is an opportunity to influence the purchasing behavior of a growing number of Starbucks customers with this project.

For the transcript dataset, I don't expect there to be any missing data since these records were generated during the experiment and should have been set up to record all relevant information.

In [17]:
transcript.isnull().sum()
Out[17]:
person    0
event     0
value     0
time      0
dtype: int64

As expected, there are no missing values in the transcript dataset. To better understand the data in this dataset, I will have to use different exploration techniques than those used on the profile dataset since most of those features were continuous numerical or categorical types.

In [18]:
# plot a pie chart of the 'event' feature
fig, ax = plt.subplots(figsize=(16,9))
ax.pie(transcript['event'].value_counts(),labels=transcript['event'].value_counts().index,autopct='%1.f%%')
ax.set_title('Breakdown of Event Types')
plt.show()
In [19]:
# determine the number of unique members in the transcript dataframe
n_members = len(np.unique(transcript['person']))
print('{} members had an event captured in the transcript dataset'.format(n_members))
17000 members had an event captured in the transcript dataset
In [20]:
# group the transcript dataframe by 'person'
transcript_grouped = transcript.groupby('person')
In [21]:
# plot distribution of events per person
fig, ax = plt.subplots(figsize=(16,9))
ax.hist(transcript_grouped['event'].count(),bins=range(0,55,5))
ax.set_xlabel('Events per person',fontsize=12)
ax.set_ylabel('Count',fontsize=12)
ax.set_title('Distribution of Events per Person',fontsize=16)
plt.show()
In [22]:
# generate descriptive statistics on the events per person
transcript_grouped['event'].count().describe()
Out[22]:
count    17000.000000
mean        18.031412
std          6.849595
min          1.000000
25%         13.000000
50%         17.000000
75%         23.000000
max         51.000000
Name: event, dtype: float64
In [23]:
# confirm that all of the customers in the 'transcript' dataframe are also in the 'profile' dataframe
print('Number of users that are in the transcript dataset but not the profile dataset:',len(np.setdiff1d(np.unique(transcript['person']),profile['id'])))
Number of users that are in the transcript dataset but not the profile dataset: 0
In [24]:
# plot distribution of events over time
fig, ax = plt.subplots(figsize=(16,9))
ax.scatter(transcript.index,transcript['time'])
ax.set_xlabel('Event Number',fontsize=12)
ax.set_ylabel('Time since start of experiment (hour)',fontsize=12)
ax.set_title('Distribution of Events over Time',fontsize=16)
plt.show()

The transcript dataset is broken up fairly evenly between transactions (45%) and reward-based events consisting of offers received (25%), viewed (19%), and completed (11%). Based on the percentages, I noticed that approximately 80% of the rewards that were received were also viewed (19%/25%), and less than half (11%/25%) were completed. Each user had at least 1 event, and all of the users recorded in the transcript dataset are also in the profile dataset. Finally, the distribution of events over time shows a large number of events occurring in a single hour periodically throughout the 30-day experimental period. These periods of activity probably correspond with the distribution new reward offers, which can happen simultaneously if they are generated automatically. The way these reward offers are distributed in groups may serve to break up the experiment into multiple segments. This raises the question: are new offers only sent out after the previous offers expired? Or is there overlap between offers for a given customer? I will investigate this.

Data Cleaning

Now that I better understand the data that I am working with, I will clean it to make it easier to work with. This includes one hot encoding categorical features, extracting data from lists or dictionaries within the datasets, converting strings to dates and times, and remapping reward and member ids to something that is easier to comprehend.

In [25]:
# view the portfolio dataset
portfolio
Out[25]:
reward channels difficulty duration offer_type id
0 10 [email, mobile, social] 10 7 bogo ae264e3637204a6fb9bb56bc8210ddfd
1 10 [web, email, mobile, social] 10 5 bogo 4d5c57ea9a6940dd891ad53e9dbe8da0
2 0 [web, email, mobile] 0 4 informational 3f207df678b143eea3cee63160fa8bed
3 5 [web, email, mobile] 5 7 bogo 9b98b8c7a33c4b65b9aebfe6a799e6d9
4 5 [web, email] 20 10 discount 0b1e1539f2cc45b7b9fa7c272da2e1d7
5 3 [web, email, mobile, social] 7 7 discount 2298d6c36e964ae4a3e7e9706d1fb8c2
6 2 [web, email, mobile, social] 10 10 discount fafdcd668e3743c1bb461111dcafc2a4
7 0 [email, mobile, social] 0 3 informational 5a8bc65990b245e5a138643cd4eb9837
8 5 [web, email, mobile, social] 5 5 bogo f19421c1d4aa40978ebb69ca19b0e20d
9 2 [web, email, mobile] 10 7 discount 2906b810c7d4411798c6938adc9daaa5
In [26]:
# one hot encode the 'offer_type' feature of the 'portfolio' dataset
portfolio = pd.get_dummies(portfolio,prefix='',prefix_sep='',columns=['offer_type'])
portfolio
Out[26]:
reward channels difficulty duration id bogo discount informational
0 10 [email, mobile, social] 10 7 ae264e3637204a6fb9bb56bc8210ddfd 1 0 0
1 10 [web, email, mobile, social] 10 5 4d5c57ea9a6940dd891ad53e9dbe8da0 1 0 0
2 0 [web, email, mobile] 0 4 3f207df678b143eea3cee63160fa8bed 0 0 1
3 5 [web, email, mobile] 5 7 9b98b8c7a33c4b65b9aebfe6a799e6d9 1 0 0
4 5 [web, email] 20 10 0b1e1539f2cc45b7b9fa7c272da2e1d7 0 1 0
5 3 [web, email, mobile, social] 7 7 2298d6c36e964ae4a3e7e9706d1fb8c2 0 1 0
6 2 [web, email, mobile, social] 10 10 fafdcd668e3743c1bb461111dcafc2a4 0 1 0
7 0 [email, mobile, social] 0 3 5a8bc65990b245e5a138643cd4eb9837 0 0 1
8 5 [web, email, mobile, social] 5 5 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0
9 2 [web, email, mobile] 10 7 2906b810c7d4411798c6938adc9daaa5 0 1 0

To one hot encode the 'channels' feature, I can't use get_dummies because there are multiple entries in each row. I will have to iterate through each row and manually assign values instead.

In [27]:
# create a list of all options for 'channels' (easy to do visually/manually in a small df)
channel_list = ['web','email','mobile','social']
In [28]:
# write a function to encode the channels in 'channels' into separate columns
def encode_column(df,column,col_list):
    '''
    One hot encodes a column of a dataframe when multiple entries occur in a single column
    Drops the original column once encoding is done
    
    Inputs
    df (pandas dataframe): the dataframe on which to perform the encoding
    column (string): name of the column to encode
    col_list (list): all values which can be in 'column'
    
    Returns
    df (pandas dataframe): the original df with encoded columns and without the original column
    '''
    for col in col_list: # loop through the list of values in column
        df.loc[:,col] = 0 # initialize each new column with 0's
    # create progress bar to track progress
    bar = progressbar.ProgressBar(maxval=len(df.index),widgets=[progressbar.Bar('=','[',']',' '),' ',progressbar.Percentage()])
    bar.start()
    cter = 1
    for idx in df.index: # loop through the indices of the df
        for value in df.loc[idx,column]:
            df.loc[idx,value] = 1 # assign a value of 1 to the corresponding index,column that corresponds to 'value'
        bar.update(cter) # update progress bar
        cter += 1 # update counter
    bar.finish()
    
    # drop the original column
    df = df.drop(columns=column)
    return df
In [29]:
# run the 'encode_column' function on the 'channels' column in 'portfolio' df and assign to 'portfolio_encoded'
portfolio_encoded = encode_column(portfolio,'channels',channel_list)

# view new 'portfolio_encoded' df
portfolio_encoded
[========================================================================] 100%
Out[29]:
reward difficulty duration id bogo discount informational web email mobile social
0 10 10 7 ae264e3637204a6fb9bb56bc8210ddfd 1 0 0 0 1 1 1
1 10 10 5 4d5c57ea9a6940dd891ad53e9dbe8da0 1 0 0 1 1 1 1
2 0 0 4 3f207df678b143eea3cee63160fa8bed 0 0 1 1 1 1 0
3 5 5 7 9b98b8c7a33c4b65b9aebfe6a799e6d9 1 0 0 1 1 1 0
4 5 20 10 0b1e1539f2cc45b7b9fa7c272da2e1d7 0 1 0 1 1 0 0
5 3 7 7 2298d6c36e964ae4a3e7e9706d1fb8c2 0 1 0 1 1 1 1
6 2 10 10 fafdcd668e3743c1bb461111dcafc2a4 0 1 0 1 1 1 1
7 0 0 3 5a8bc65990b245e5a138643cd4eb9837 0 0 1 0 1 1 1
8 5 5 5 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 1 1 1 1
9 2 10 7 2906b810c7d4411798c6938adc9daaa5 0 1 0 1 1 1 0
In [30]:
# since every reward can be delivered via email, this does not add any information to the analysis
# I will drop the 'email' column
portfolio_encoded.drop(columns='email',inplace=True)

The final bit of cleaning I will perform on the portfolio dataset is more cosmetic than functional. I will replace the 32 character alphanumeric 'id' with an integer value and group them by reward type. This will make it easier for me to quickly refer to the reward ids later in the analysis.

In [31]:
# sort the 'portfolio_encoded' df so that it is grouped by reward type, then increasing diffulty and duration
portfolio_encoded.sort_values(by=['bogo','discount','informational','difficulty','duration'],
                              ascending=[False,False,False,True,True],inplace=True)
portfolio_encoded
Out[31]:
reward difficulty duration id bogo discount informational web mobile social
8 5 5 5 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 1 1 1
3 5 5 7 9b98b8c7a33c4b65b9aebfe6a799e6d9 1 0 0 1 1 0
1 10 10 5 4d5c57ea9a6940dd891ad53e9dbe8da0 1 0 0 1 1 1
0 10 10 7 ae264e3637204a6fb9bb56bc8210ddfd 1 0 0 0 1 1
5 3 7 7 2298d6c36e964ae4a3e7e9706d1fb8c2 0 1 0 1 1 1
9 2 10 7 2906b810c7d4411798c6938adc9daaa5 0 1 0 1 1 0
6 2 10 10 fafdcd668e3743c1bb461111dcafc2a4 0 1 0 1 1 1
4 5 20 10 0b1e1539f2cc45b7b9fa7c272da2e1d7 0 1 0 1 0 0
7 0 0 3 5a8bc65990b245e5a138643cd4eb9837 0 0 1 0 1 1
2 0 0 4 3f207df678b143eea3cee63160fa8bed 0 0 1 1 1 0
In [32]:
def id_mapper(df,column):
    '''
    Create a dictionary with keys corresponding to original ids and values corresponding to new integer ids
    
    Inputs
    df (pandas dataframe): the dataframe containing ids to remap
    column (string): the name of the id column
    
    Returns
    id_dict (dict): dictionary with 
    '''
    id_dict = {}
    counter = 1 # initialize counter for reward_ids
    for _id in df[column]:
        if _id not in id_dict.keys():
            id_dict[_id] = counter
            counter+=1
    return id_dict
In [33]:
# create the dictionary for reward ids
reward_id_dict = id_mapper(portfolio_encoded,'id')

# view the reward_id dictionary
reward_id_dict
Out[33]:
{'f19421c1d4aa40978ebb69ca19b0e20d': 1,
 '9b98b8c7a33c4b65b9aebfe6a799e6d9': 2,
 '4d5c57ea9a6940dd891ad53e9dbe8da0': 3,
 'ae264e3637204a6fb9bb56bc8210ddfd': 4,
 '2298d6c36e964ae4a3e7e9706d1fb8c2': 5,
 '2906b810c7d4411798c6938adc9daaa5': 6,
 'fafdcd668e3743c1bb461111dcafc2a4': 7,
 '0b1e1539f2cc45b7b9fa7c272da2e1d7': 8,
 '5a8bc65990b245e5a138643cd4eb9837': 9,
 '3f207df678b143eea3cee63160fa8bed': 10}
In [34]:
def map_id_column(df,old_col,new_col,id_dict):
    '''
    Creates a new column in a dataframe with values based on the values in an existing column and a mapping dictonary
    Drops the existing column
    
    Inputs
    df (pandas dataframe): the dataframe containing the column to map
    old_col (string): the name of the existing column
    new_col (string): the name of the new column to create
    id_dict (dict): dictionary with old_col entries as keys and new_col entries as values
    
    Returns
    df (pandas dataframe): the original df with the new column added and the old column dropped
    '''
    df[new_col] = df[old_col].apply(lambda x: id_dict.get(x,0)) # create new column with values mapped from the old column
    df.drop(columns=old_col,inplace=True) # drop the old column
    
    return df
In [35]:
portfolio_encoded = map_id_column(portfolio_encoded,'id','reward_id',reward_id_dict)
portfolio_encoded
Out[35]:
reward difficulty duration bogo discount informational web mobile social reward_id
8 5 5 5 1 0 0 1 1 1 1
3 5 5 7 1 0 0 1 1 0 2
1 10 10 5 1 0 0 1 1 1 3
0 10 10 7 1 0 0 0 1 1 4
5 3 7 7 0 1 0 1 1 1 5
9 2 10 7 0 1 0 1 1 0 6
6 2 10 10 0 1 0 1 1 1 7
4 5 20 10 0 1 0 1 0 0 8
7 0 0 3 0 0 1 0 1 1 9
2 0 0 4 0 0 1 1 1 0 10
In [473]:
# save portfolio_encoded in pickle file
pickle.dump(portfolio_encoded,open('pickle_files/portfolio_encoded.p','wb'))
In [474]:
# load portfolio_encoded from pickle file
portfolio_encoded = pickle.load(open('pickle_files/portfolio_encoded.p','rb'))
In [36]:
# create an inverse dictionary mapper in case I want to access the original ids
def id_mapper_inverse(id_dict):
    '''
    Creates an inverse dictionary where the keys become the values and the values become the keys
    
    Input
    id_dict (dict): the original dictionary to invert
    
    Returns
    id_dict_inv (dict): the inverted dictionary
    '''
    id_dict_inv = {} # initialize dictionary
    for key, value in id_dict.items(): # loop through the key-value pairs
        id_dict_inv[value] = key # assign the value as the key and the key as the value of the inverse dictionary
    
    return id_dict_inv
In [37]:
# test the id_mapper_inv function on the reward_id_dict
reward_id_dict_inv = id_mapper_inverse(reward_id_dict)
reward_id_dict_inv
Out[37]:
{1: 'f19421c1d4aa40978ebb69ca19b0e20d',
 2: '9b98b8c7a33c4b65b9aebfe6a799e6d9',
 3: '4d5c57ea9a6940dd891ad53e9dbe8da0',
 4: 'ae264e3637204a6fb9bb56bc8210ddfd',
 5: '2298d6c36e964ae4a3e7e9706d1fb8c2',
 6: '2906b810c7d4411798c6938adc9daaa5',
 7: 'fafdcd668e3743c1bb461111dcafc2a4',
 8: '0b1e1539f2cc45b7b9fa7c272da2e1d7',
 9: '5a8bc65990b245e5a138643cd4eb9837',
 10: '3f207df678b143eea3cee63160fa8bed'}

The portfolio dataset now consists of only numeric values. The reward_ids are grouped in a way the I can remember and quickly refer to later:

  • BOGO offers are ids 1-4 with increasing difficulty and duration
  • Discount offers are ids 5-8 with increasing difficulty and duration
  • Informational offers are ids 9-10 with increasing duration

I have already handled the missing values in the profile dataset. Now I will convert the enrollment date ('became_member_on') string to a date object and a timestamp and map the alphanumeric 'id' to a numeric user_id.

In [38]:
# convert 'became_member_on' string to date object
profile['enrollment_date'] = profile['became_member_on'].apply(lambda x: datetime.strptime(str(x),'%Y%m%d').date())
profile['enrollment_tstamp'] = profile['became_member_on'].apply(lambda x: datetime.strptime(str(x),'%Y%m%d').timestamp())
profile.head()
Out[38]:
gender age id became_member_on income enrollment_date enrollment_tstamp
0 NaN NaN 68be06ca386d4c31939f3a4f0e3dd783 20170212 NaN 2017-02-12 1.486879e+09
1 F 55.0 0610b486422d4921ae7d2bf64640c50b 20170715 112000.0 2017-07-15 1.500095e+09
2 NaN NaN 38fe809add3b4fcf9315a9694bb96ff5 20180712 NaN 2018-07-12 1.531372e+09
3 F 75.0 78afa995795e4d85b5d9ceeca43f5fef 20170509 100000.0 2017-05-09 1.494306e+09
4 NaN NaN a03223e636434f42ac4c3df47e8bac43 20170804 NaN 2017-08-04 1.501823e+09
In [39]:
# one hot encode 'gender' column
profile_encoded = pd.get_dummies(profile,columns=['gender'])
# drop the 'became_member_on' column
profile_encoded.drop(columns='became_member_on',inplace=True)
# preview df
profile_encoded.head()
Out[39]:
age id income enrollment_date enrollment_tstamp gender_F gender_M gender_O
0 NaN 68be06ca386d4c31939f3a4f0e3dd783 NaN 2017-02-12 1.486879e+09 0 0 0
1 55.0 0610b486422d4921ae7d2bf64640c50b 112000.0 2017-07-15 1.500095e+09 1 0 0
2 NaN 38fe809add3b4fcf9315a9694bb96ff5 NaN 2018-07-12 1.531372e+09 0 0 0
3 75.0 78afa995795e4d85b5d9ceeca43f5fef 100000.0 2017-05-09 1.494306e+09 1 0 0
4 NaN a03223e636434f42ac4c3df47e8bac43 NaN 2017-08-04 1.501823e+09 0 0 0
In [40]:
# sort the profile dataset by enrollment date then use the id_mapper function
member_id_dict = id_mapper(profile_encoded.sort_values(by=['enrollment_date','age']),'id')

profile_encoded = map_id_column(profile_encoded,'id','member_id',member_id_dict)
profile_encoded.head()
Out[40]:
age income enrollment_date enrollment_tstamp gender_F gender_M gender_O member_id
0 NaN NaN 2017-02-12 1.486879e+09 0 0 0 6730
1 55.0 112000.0 2017-07-15 1.500095e+09 1 0 0 8198
2 NaN NaN 2018-07-12 1.531372e+09 0 0 0 16717
3 75.0 100000.0 2017-05-09 1.494306e+09 1 0 0 7566
4 NaN NaN 2017-08-04 1.501823e+09 0 0 0 8579
In [41]:
# create an inverse member_id dictionary
member_id_dict_inv = id_mapper_inverse(member_id_dict)

For the transcript dataset, I need to one hot encode the 'event' column, map the 'person' column to the numeric 'member_id' using the member_id dictionary, and extract the values from the dictionaries in the 'value' column.

In [42]:
# replace spaces with underscores in 'event' column
transcript['event'] = transcript['event'].apply(lambda x: x.replace(' ','_'))
# one hot encode the 'event' column
transcript_encoded = pd.get_dummies(transcript,prefix='',prefix_sep='',columns=['event'])

# preview the new df
transcript_encoded.head()
Out[42]:
person value time offer_completed offer_received offer_viewed transaction
0 78afa995795e4d85b5d9ceeca43f5fef {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} 0 0 1 0 0
1 a03223e636434f42ac4c3df47e8bac43 {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} 0 0 1 0 0
2 e2127556f4f64592b11af22de27a7932 {'offer id': '2906b810c7d4411798c6938adc9daaa5'} 0 0 1 0 0
3 8ec6ce2a7e7949b1bf142def7d0e0586 {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} 0 0 1 0 0
4 68617ca6246f4fbc85e91a2a49552598 {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} 0 0 1 0 0
In [43]:
# map the alphanumeric 'person' column to 'member_id'
transcript_encoded = map_id_column(transcript_encoded,'person','member_id',member_id_dict)
transcript_encoded.head()
Out[43]:
value time offer_completed offer_received offer_viewed transaction member_id
0 {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} 0 0 1 0 0 7566
1 {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} 0 0 1 0 0 8579
2 {'offer id': '2906b810c7d4411798c6938adc9daaa5'} 0 0 1 0 0 15235
3 {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} 0 0 1 0 0 10036
4 {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} 0 0 1 0 0 10235
In [44]:
# identify all possible options for keys in the 'value' column
value_set = set() # initialize an empty set
for row in transcript_encoded['value']: # loop through the 'value' column
    for key in row.keys():
        value_set.add(key) # add the key from each dictionary to the set

# view the set
value_set
Out[44]:
{'amount', 'offer id', 'offer_id', 'reward'}
In [45]:
# create new columns for the keys in the 'value' column
for col in list(value_set):
    transcript_encoded[col] = 0 # initialize them with 0's
    
transcript_encoded.head()
Out[45]:
value time offer_completed offer_received offer_viewed transaction member_id offer id amount offer_id reward
0 {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} 0 0 1 0 0 7566 0 0 0 0
1 {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} 0 0 1 0 0 8579 0 0 0 0
2 {'offer id': '2906b810c7d4411798c6938adc9daaa5'} 0 0 1 0 0 15235 0 0 0 0
3 {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} 0 0 1 0 0 10036 0 0 0 0
4 {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} 0 0 1 0 0 10235 0 0 0 0
In [46]:
# # loop through the indices and assign the values in the 'value' dictionaries to the columns matching the keys
# # use a progress bar to track progress (takes a while)
# bar = progressbar.ProgressBar(maxval=transcript_encoded.index[-1],widgets=[progressbar.Bar('=','[',']',' '),' ',progressbar.Percentage()])
# bar.start()
# for idx in transcript_encoded.index:
#     value_dict = transcript_encoded.loc[idx,'value']
#     for key, value in value_dict.items():
#         transcript_encoded.loc[idx,key] = value
#     bar.update(idx)
    
# bar.finish()
In [47]:
# # save the dataframe in a pickle file to skip the previous step in future runs
# pickle.dump(transcript_encoded,open('pickle_files/transcript_encoded.p','wb'))
In [48]:
# load the pickle file
transcript_encoded = pickle.load(open('pickle_files/transcript_encoded.p','rb'))

# preview the dataframe loaded from the pickle file
transcript_encoded.head()
Out[48]:
time value offer_completed offer_received offer_viewed transaction member_id amount offer_id reward offer id
0 0 {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} 0 1 0 0 7566 0.0 0 0 9b98b8c7a33c4b65b9aebfe6a799e6d9
1 0 {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} 0 1 0 0 8579 0.0 0 0 0b1e1539f2cc45b7b9fa7c272da2e1d7
2 0 {'offer id': '2906b810c7d4411798c6938adc9daaa5'} 0 1 0 0 15235 0.0 0 0 2906b810c7d4411798c6938adc9daaa5
3 0 {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} 0 1 0 0 10036 0.0 0 0 fafdcd668e3743c1bb461111dcafc2a4
4 0 {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} 0 1 0 0 10235 0.0 0 0 4d5c57ea9a6940dd891ad53e9dbe8da0
In [49]:
# count the number of rows where each of the 'value' features has been populated
(transcript_encoded.iloc[:,-4:] != 0).sum()
Out[49]:
amount      138953
offer_id     33579
reward       33579
offer id    134002
dtype: int64
In [50]:
# check if there are any rows where both 'offer id' and 'offer_id' are populated
print('The number of rows where both "offer id" and "offer_id" are populated:',
      ((transcript_encoded.loc[:,'offer id'] !=0) & (transcript_encoded.loc[:,'offer_id'] !=0)).sum())
The number of rows where both "offer id" and "offer_id" are populated: 0
In [51]:
# since there are no rows where both 'offer id' and 'offer_id' are populated, I can combine them into one column
transcript_encoded['offer_id'] = transcript_encoded[['offer id','offer_id']].apply(lambda x: x['offer id'] if x['offer id'] != 0 else x['offer_id'],axis=1)
# drop the 'offer id' column and 'value' columns
transcript_encoded.drop(columns=['value','offer id'],inplace=True)

# preview the dataframe
transcript_encoded.head()
Out[51]:
time offer_completed offer_received offer_viewed transaction member_id amount offer_id reward
0 0 0 1 0 0 7566 0.0 9b98b8c7a33c4b65b9aebfe6a799e6d9 0
1 0 0 1 0 0 8579 0.0 0b1e1539f2cc45b7b9fa7c272da2e1d7 0
2 0 0 1 0 0 15235 0.0 2906b810c7d4411798c6938adc9daaa5 0
3 0 0 1 0 0 10036 0.0 fafdcd668e3743c1bb461111dcafc2a4 0
4 0 0 1 0 0 10235 0.0 4d5c57ea9a6940dd891ad53e9dbe8da0 0
In [52]:
# map the offer_ids to reward_ids
transcript_encoded = map_id_column(transcript_encoded, 'offer_id','reward_id',reward_id_dict)

# preview the dataframe
transcript_encoded.head()
Out[52]:
time offer_completed offer_received offer_viewed transaction member_id amount reward reward_id
0 0 0 1 0 0 7566 0.0 0 2
1 0 0 1 0 0 8579 0.0 0 8
2 0 0 1 0 0 15235 0.0 0 6
3 0 0 1 0 0 10036 0.0 0 7
4 0 0 1 0 0 10235 0.0 0 3

Feature Extraction

Now that the transcript dataset has been cleaned and only numeric values remain, I will transform the dataset to group transactions by member_id and determine if transactions occured while an offer was active or not. There are three criteria to determine if an offer is active at a given time: 1) the offer was viewed, 2) the offer has not expired, and 3) the offer has not been completed (applies to BOGO offers).

In [53]:
def get_member_transcript(member_id, df=transcript_encoded):
    '''
    Get all actions for a given member
    
    Input
    member_id (int): the numeric member_id for the user whose transcript you want to view
    df (dataframe): the dataframe from which to extract the appropriate rows
    
    Returns
    transcript (dataframe): All actions for that member in the dataset, sorted by time
    '''
    transcript = df.loc[df['member_id']==member_id,:]
    transcript = transcript.sort_values(by='time')
    
    return transcript
In [54]:
# test get_member_transcript function
get_member_transcript(7566)
Out[54]:
time offer_completed offer_received offer_viewed transaction member_id amount reward reward_id
0 0 0 1 0 0 7566 0.00 0 2
15561 6 0 0 1 0 7566 0.00 0 2
47582 132 0 0 0 1 7566 19.89 0 0
47583 132 1 0 0 0 7566 0.00 5 2
49502 144 0 0 0 1 7566 17.78 0 0
53176 168 0 1 0 0 7566 0.00 0 9
85291 216 0 0 1 0 7566 0.00 0 9
87134 222 0 0 0 1 7566 19.67 0 0
92104 240 0 0 0 1 7566 29.72 0 0
141566 378 0 0 0 1 7566 23.93 0 0
150598 408 0 1 0 0 7566 0.00 0 4
163375 408 0 0 1 0 7566 0.00 0 4
201572 504 0 1 0 0 7566 0.00 0 1
218393 510 0 0 0 1 7566 21.72 0 0
218394 510 1 0 0 0 7566 0.00 10 4
218395 510 1 0 0 0 7566 0.00 5 1
230412 534 0 0 0 1 7566 26.56 0 0
262138 582 0 0 1 0 7566 0.00 0 1
In [55]:
def get_member_transactions(member_id, df=transcript_encoded):
    '''
    Get all transactions for a given member
    
    Input
    member_id (int): the numeric member_id for the user whose transactions you want to view
    df (dataframe): the dataframe from which to extract the appropriate rows
    
    Returns
    transactions (dataframe): All transactions for that member in the dataset, sorted by time
    '''
    transactions = df.loc[(df['member_id']==member_id)&(df['transaction']==1),:]
    transactions = transactions.sort_values(by='time')
    
    return transactions
In [56]:
# test get_member_transactions function
get_member_transactions(7566)
Out[56]:
time offer_completed offer_received offer_viewed transaction member_id amount reward reward_id
47582 132 0 0 0 1 7566 19.89 0 0
49502 144 0 0 0 1 7566 17.78 0 0
87134 222 0 0 0 1 7566 19.67 0 0
92104 240 0 0 0 1 7566 29.72 0 0
141566 378 0 0 0 1 7566 23.93 0 0
218393 510 0 0 0 1 7566 21.72 0 0
230412 534 0 0 0 1 7566 26.56 0 0
In [57]:
def create_active_offer_dict(df):
    '''
    Creates a dictionary with keys corresponding to member_id's and values are tuples containing the following:
    - time offer was viewed
    - time offer expired
    - reward_id
    - time offer was completed (optional)
    
    Input
    df (pandas dataframe): should contain time, action (offer received, viewed, completed, transaction), and reward_id
    
    Returns
    active_offer_dict: the dictionary described above
    '''
    # merge df with the portfolio_encoded df to add 'duration' column
    df = df.merge(portfolio_encoded,how='left',on='reward_id',suffixes=('_trans','_offer'))
    # initialize dictionary
    active_offer_dict = {}
    
    for member in np.unique(df['member_id']): # loop through all member ids in df
        # create dataframe with only transactions for one member
        member_trans_df = get_member_transactions(member, df)
        # create dataframe with only rows relating to offers received or viewed
        offer_rec_view_df = member_trans_df.loc[(member_trans_df['offer_received']==1) | (member_trans_df['offer_viewed'] == 1),:]
        if offer_rec_view_df.empty:
            continue # proceed to the next member if there are no offers received
        # create dataframe with only rows relating to offers completed
        completions_df = member_trans_df.loc[member_trans_df['offer_completed']==1,:]
        # shift 'time' values up one row in offer_rec_view_df to align the time an offer was viewed
        # with the offer that was received. store in an array
        time_viewed = offer_rec_view_df.groupby('reward_id')['time'].shift(-1)
        # calculate the times that offers expire by adding the offer duration (converted from days to hours)
        # to the time the offer was received. store in an array
        time_expired = offer_rec_view_df.apply(lambda x: x['time']+x['duration']*24 if x['offer_received']==1 else np.nan,axis=1)
        # create list of reward_ids for the offers that were received
        reward_id_list = offer_rec_view_df['reward_id']
        # combine the time_viewed, time_expired, and reward_id_list arrays into a df (to easily remove NaN's)
        offer_window_df = pd.DataFrame({'time_viewed':time_viewed,'time_expired':time_expired,'reward_id':reward_id_list})
        # drop missing values
        offer_window_df.dropna(how='any',inplace=True)
        # create list of tuples with format (time_viewed, time_expired, reward_id)
        offer_window_list = list(zip(offer_window_df['time_viewed'],offer_window_df['time_expired'],offer_window_df['reward_id']))
        
        # add the time an event was completed to the appropriate tuple in offer_window_list
        for row in completions_df.index:
            time = completions_df.loc[row,'time'] # extract the time the offer was completed
            reward_id = completions_df.loc[row,'reward_id'] # extract the associated reward_id 
            for idx in range(len(offer_window_list)): # loop through all active offer tuples
                start = offer_window_list[idx][0]
                end = offer_window_list[idx][1]
                reward = offer_window_list[idx][2]
                if (time >= start) & (time <= end) & (reward_id == reward):
                    offer_window_list[idx] += (time,) # add the offer completion time to the matching active offer
        # update the dictionary with the member_id as the new key and the list of active offer tuples as the value
        active_offer_dict[member] = offer_window_list
    return active_offer_dict
In [58]:
active_offer_dict = create_active_offer_dict(transcript_encoded)
In [59]:
def get_active_offers(transaction_time, member_id):
    '''
    Identify the active offer(s) (if any) for a given transaction
    
    Input
    transaction_time (int): the hour when the transaction occurred
    
    Returns
    offers_active (list): list of the reward_ids of the active offers or 0 if no offers are active
    '''
    offers_active = [] # intialize empty list
    offer_list = active_offer_dict.get(member_id,0) # get the list of offers for that member or return 0
    if offer_list == 0: # if 0 is returned, member has no offers, so return [0]
        return [0]
    else:
        for offer in offer_list: # loop through all offers for the given member
            start = offer[0] # extract time_viewed as start of active window
            if len(offer) == 4: # if offer was completed
                end = offer[3] # extract time completed as end of active window
            else:
                end = offer[1] # or extract expiration time as end of active window
            reward = offer[2] # extract reward_id
            if (transaction_time >= start) & (transaction_time <= end):
                # append reward to the offers_active list if the transaction time is in the active offer window
                offers_active.append(reward)
        if len(offers_active) == 0:
            offers_active.append(0) # append 0 if there are no active offers (rather than return an empty list)
        return offers_active
In [62]:
test_transactions = get_member_transactions(7566)
In [63]:
offers_active = []
for time, member in zip(test_transactions['time'],test_transactions['member_id']):
    offers_active.append(get_active_offers(time,member))
offers_active
Out[63]:
[[0], [0], [0], [0], [0], [0], [0]]
In [64]:
def insert_active_offers(df, col_num):
    '''
    Inserts the list of active offer lists into the df at the specified location with the column name 'active_offers'
    
    Inputs
    df (dataframe): the dataframe to insert active offers into, likely a df of transactions
    col_num (int): the location at which to insert the column
    
    Returns
    df (dataframe): the input df with the inserted 'active_offers' column
    '''
    active_offers = [] # initialize list
    for time, member in zip(df['time'],df['member_id']):
        active_offers.append(get_active_offers(time,member))
        
    df.insert(col_num,'active_offers',active_offers)
    
    return df
In [65]:
# Create transactions df and run insert_active_offers function
transactions = transcript_encoded.loc[transcript_encoded['transaction']==1,:]
transactions = insert_active_offers(transactions,7)

# preview transactions df
transactions.head()
Out[65]:
time offer_completed offer_received offer_viewed transaction member_id amount active_offers reward reward_id
12654 0 0 0 0 1 4694 0.83 [0] 0 0
12657 0 0 0 0 1 2971 34.56 [0] 0 0
12659 0 0 0 0 1 12705 13.23 [0] 0 0
12670 0 0 0 0 1 10639 19.51 [0] 0 0
12671 0 0 0 0 1 12395 18.97 [0] 0 0
In [280]:
# # run encode_column function on 'active_offers' to one hot encode (takes a while)
# transactions_encoded = encode_column(transactions,'active_offers',range(11))

# # save to a pickle file to avoid step in future
# pickle.dump(transactions_encoded,open('pickle_files/transactions_encoded.p','wb'))

# # preview transactions_encoded df
# transactions_encoded.head()
[========================================================================] 100%
Out[280]:
time offer_completed offer_received offer_viewed transaction member_id amount reward reward_id 0 1 2 3 4 5 6 7 8 9 10
12654 0 0 0 0 1 4694 0.83 0 0 0 0 0 0 1 0 0 0 0 0 0
12657 0 0 0 0 1 2971 34.56 0 0 0 0 0 0 0 0 1 0 0 0 0
12659 0 0 0 0 1 12705 13.23 0 0 1 0 0 0 0 0 0 0 0 0 0
12670 0 0 0 0 1 10639 19.51 0 0 1 0 0 0 0 0 0 0 0 0 0
12671 0 0 0 0 1 12395 18.97 0 0 1 0 0 0 0 0 0 0 0 0 0
In [68]:
# load the transactions_encoded df from the pickle file
transactions_encoded = pickle.load(open('pickle_files/transactions_encoded.p','rb'))

transactions_encoded.head()
Out[68]:
time offer_completed offer_received offer_viewed transaction member_id amount reward reward_id 0 1 2 3 4 5 6 7 8 9 10
12654 0 0 0 0 1 4694 0.83 0 0 0 0 0 0 1 0 0 0 0 0 0
12657 0 0 0 0 1 2971 34.56 0 0 0 0 0 0 0 0 1 0 0 0 0
12659 0 0 0 0 1 12705 13.23 0 0 1 0 0 0 0 0 0 0 0 0 0
12670 0 0 0 0 1 10639 19.51 0 0 1 0 0 0 0 0 0 0 0 0 0
12671 0 0 0 0 1 12395 18.97 0 0 1 0 0 0 0 0 0 0 0 0 0
In [69]:
# merge the profile dataset with the transactions_encoded dataset to add demographic data as features
learning_df = transactions_encoded.merge(profile_encoded,how='left',on='member_id',suffixes=('_trans','_prof'))

# preview new df
learning_df.head()
Out[69]:
time offer_completed offer_received offer_viewed transaction member_id amount reward reward_id 0 ... 8 9 10 age income enrollment_date enrollment_tstamp gender_F gender_M gender_O
0 0 0 0 0 1 4694 0.83 0 0 0 ... 0 0 0 20.0 30000.0 2016-07-11 1.468213e+09 1 0 0
1 0 0 0 0 1 2971 34.56 0 0 0 ... 0 0 0 42.0 96000.0 2016-01-17 1.453010e+09 0 1 0
2 0 0 0 0 1 12705 13.23 0 0 1 ... 0 0 0 36.0 56000.0 2017-12-28 1.514441e+09 0 1 0
3 0 0 0 0 1 10639 19.51 0 0 1 ... 0 0 0 55.0 94000.0 2017-10-16 1.508130e+09 1 0 0
4 0 0 0 0 1 12395 18.97 0 0 1 ... 0 0 0 39.0 67000.0 2017-12-17 1.513490e+09 1 0 0

5 rows × 27 columns

In [71]:
# drop columns that are unnecessary for machine learning
learning_df.drop(columns=['time','offer_completed','offer_received','offer_viewed','transaction','member_id',\
                          'reward','reward_id','enrollment_date'],inplace=True)
# fill missing values ('age','income') with the mean
learning_df = learning_df.fillna(learning_df.mean())

# save the df to a pickle file
pickle.dump(learning_df,open('pickle_files/learning_df.p','wb'))

# preview df
learning_df.head()
Out[71]:
amount 0 1 2 3 4 5 6 7 8 9 10 age income enrollment_tstamp gender_F gender_M gender_O
0 0.83 0 0 0 0 1 0 0 0 0 0 0 20.0 30000.0 1.468213e+09 1 0 0
1 34.56 0 0 0 0 0 0 1 0 0 0 0 42.0 96000.0 1.453010e+09 0 1 0
2 13.23 1 0 0 0 0 0 0 0 0 0 0 36.0 56000.0 1.514441e+09 0 1 0
3 19.51 1 0 0 0 0 0 0 0 0 0 0 55.0 94000.0 1.508130e+09 1 0 0
4 18.97 1 0 0 0 0 0 0 0 0 0 0 39.0 67000.0 1.513490e+09 1 0 0

Modeling

Model Selection

Now that I have cleaned and extracted features, it is time to use those features to model transaction behavior. The goal of the model is to predict the amount a member will spend in a transaction based on their demographic information (age, gender, income, length of membership) and the reward that is active (if any). Since the output variable ('amount') is continuous, I will use regression models and score them using mean squared error. Before training the models, I will have to split the dataset into training and testing sets and normalize the age, income, and enrollment_tstamp features to match the scale of the categorical features (0 to 1).

In [72]:
# load the cleaned and transformed dataframe to be used for modeling
learning_df = pickle.load(open('pickle_files/learning_df.p','rb'))
In [73]:
def create_train_test_split(df=learning_df,test_size=0.3):
    '''
    Splits a df into input (X) and output (y) arrays, then splits the arrays into training and testing sets
    Note: since the transaction data is time-based, I will not shuffle the dataset and instead use the older
    transaction as training data and the more recent transactions as the test data.
    
    Inputs
    df (pandas dataframe): the learning_df with 'amount' column as output feature
    test_size (float): size of the test dataset
    '''
    # Split the df into input (X) and output (y) arrays
    X = df.drop(columns='amount')
    y = df['amount']
    # use sklearn's train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=test_size,shuffle=False)
    return X_train, X_test, y_train, y_test
In [74]:
# run create_train_test_split on the dataset
X_train, X_test, y_train, y_test = create_train_test_split()
In [75]:
scaler=MinMaxScaler() # instantiate a MinMaxScaler
X_train = scaler.fit_transform(X_train) # fit the scaler to the input training data
X_test = scaler.transform(X_test) # transform the input testing data based on the fit of the training data

I want to quickly instantiate, fit, predict, and evaluate various regression models to determine which model performs best on my dataset. To do that, I will write a couple functions to avoid repeating myself. The first function, fit_test_model will fit and predict values then compare them to the actual values for the training and testing data. The second function, plot_pred_actual will call the fit_test_model function and then use the outputs from that function to make a scatter plot of the predicted vs. actual values for the training dataset. This will allow me to quickly fit a new model and evaluate not only the score (MSE or MAE) but also visually inspect the shape of the error to better understand how a model is performing.

In [76]:
def fit_test_model(model, X_train, X_test, y_train, y_test):
    '''
    Fits a model to training data, makes predictions on training and testing data,
    calculates mean squared error and mean absolute error for training and testing data sets.
    Prints the mean squared and mean absolute errors for the datasets.
    
    Inputs:
    model: an instantiated sklearn model to fit and test
    X_train (array): normalized training dataset
    X_test (array): normalized testing dataset (have same number of columns as X_train)
    y_train (1D array): output values for the training dataset (must have same number of rows as X_train)
    y_test (1D array): output values for the testing dataset (must have same number of rows as X_test)
    
    Returns:
    model: the input model fit to the training data
    y_preds_train (1D array): the predictions for the training dataset (same shape as y_train)
    y_preds_test (1D array): the predictions for the testing dataset (same shape as y_test)
    '''
    model.fit(X_train, y_train)
    y_preds_train = model.predict(X_train)
    y_preds_test = model.predict(X_test)
    print('Training MSE:',mean_squared_error(y_train,y_preds_train))
    print('Testing MSE:',mean_squared_error(y_test,y_preds_test))
    print('Training MAE:',mean_absolute_error(y_train,y_preds_train))
    print('Testing MAE:',mean_absolute_error(y_test,y_preds_test))
    return model, y_preds_train, y_preds_test
In [77]:
def plot_pred_actual(model, X_train, X_test, y_train, y_test):
    '''
    Calls the fit_test_model function then displays a scatter plot of the predicted vs actual values for the training dataset
    
    Inputs:
    model: an instantiated sklearn model to fit and test
    X_train (array): normalized training dataset
    X_test (array): normalized testing dataset (have same number of columns as X_train)
    y_train (1D array): output values for the training dataset (must have same number of rows as X_train)
    y_test (1D array): output values for the testing dataset (must have same number of rows as X_test)
    
    Returns:
    None
    '''
    
    model, y_preds_train, y_preds_test = fit_test_model(model, X_train, X_test, y_train, y_test)
    fig, ax = plt.subplots(figsize=(16,9))
    plt.scatter(y_train,y_preds_train)
    plt.xlabel('Actual Transaction Amounts',fontsize=12)
    plt.ylabel('Predicted Transaction Amounts',fontsize=12)
    plt.title('Predicted vs Actual Transaction Amounts for Training Data',fontsize=16)
    plt.show()
In [78]:
# call the plot_pred_actual function on a default linear regression model
plot_pred_actual(LinearRegression(),X_train, X_test, y_train, y_test)
Training MSE: 731.7891384061396
Testing MSE: 1122.1209392535106
Training MAE: 6.480497132449754
Testing MAE: 7.0927412009315045

There are quite a few transaction amounts in the \$200 - \\$1,000 range, which seem like extreme values compared to the rest of the dataset. I need to investigate how these values affect the overall dataset before handling them in some way. Intuitively, I don't believe there are enough input features to be able to predict transactions ranging from \$0 - \\$1,000.

In [79]:
# create a boxplot to see if these values are considered outliers
fig, ax = plt.subplots(figsize=(16,6))
plt.boxplot(learning_df['amount'],vert=False)
plt.title('Boxplot of Transaction Amounts',fontsize=16)
plt.show()
In [80]:
# check the descriptive statistics for the 'amount' feature as well
learning_df['amount'].describe()
Out[80]:
count    138953.000000
mean         12.777356
std          30.250529
min           0.050000
25%           2.780000
50%           8.890000
75%          18.070000
max        1062.280000
Name: amount, dtype: float64

Despite the presence of transactions up to \$1,062, the descriptive statistics do not appear to be heavily influenced by these outliers. The mean (\\$12.78) transaction amount is slightly higher than the median (\$8.89) transaction amount, indicating the data is (obviously) skewed right, but it is still closer to the median than to the 75th percentile. After researching how to handle outliers in output variables, I found a very useful Medium article discussing several methods to handle outliers. Those methods include: 1) using tree based models, 2) performing a log-scale transformation, 3) dropping values based on the IQR method, and 4) using mean absolute error (MAE) instead of mean squared error (MSE) since MSE gives more weight to larger errors. I will evaluate all of these methods.

In [81]:
# call the plot_pred_actual function on a random forest regression model with 100 estimators
plot_pred_actual(RandomForestRegressor(n_estimators=100),X_train, X_test, y_train, y_test)
Training MSE: 402.5715997817664
Testing MSE: 1313.3455414191562
Training MAE: 4.28220163958834
Testing MAE: 7.422871649273997
In [82]:
# call the plot_pred_actual function on a bagging regression model with 100 estimators
plot_pred_actual(BaggingRegressor(n_estimators=100),X_train, X_test, y_train, y_test)
Training MSE: 404.077316401019
Testing MSE: 1309.9490716830014
Training MAE: 4.281365084919623
Testing MAE: 7.386929446819048
In [83]:
# test a random forest regression model on log-transformed output with 100 estimators
plot_pred_actual(RandomForestRegressor(n_estimators=100),X_train, X_test, np.log(y_train), np.log(y_test))
Training MSE: 0.3612110988220569
Testing MSE: 0.7815899243189056
Training MAE: 0.4059683405113859
Testing MAE: 0.6109854763522756
In [84]:
# test a random forest regression model on log-transformed output with 100 estimators
plot_pred_actual(BaggingRegressor(n_estimators=100),X_train, X_test, np.log(y_train), np.log(y_test))
Training MSE: 0.36120317116292877
Testing MSE: 0.7787317905774245
Training MAE: 0.4060234521327019
Testing MAE: 0.6095692056472234
In [85]:
# use the IQR method identify the upper limit above which values are considered outliers (Q3 + 1.5*IQR)
Q3 = np.percentile(learning_df['amount'],75)
Q1 = np.percentile(learning_df['amount'],25)
IQR = Q3-Q1
upper_limit = Q3 + 1.5*IQR
print('The upper limit using the IQR method is {:.2f}'.format(upper_limit))
The upper limit using the IQR method is 41.00
In [86]:
# filter learning_df based on the upper limit
learning_df_IQR = learning_df.loc[learning_df['amount']<=upper_limit,:]
rows_dropped = learning_df.shape[0]-learning_df_IQR.shape[0]
print('{} out of {} rows ({:.2%}) were removed using the IQR method'\
      .format(rows_dropped,learning_df.shape[0],rows_dropped/learning_df.shape[0]))
1236 out of 138953 rows (0.89%) were removed using the IQR method

There are 1,236 transactions out of the 138,953 in the transactions dataset that are greater than \$41.00 and therefore considered outliers based on the IQR method. This is an insignifcant (less than 1\%) number of values in the data set that have a large impact on the ability to predict the transaction amounts with the available features. The tree based ensemble methods (Random Forest and Bagging) struggled to handle these outliers and the scatter plots showed the models ended up predicting transaction amounts on the order of hundreds of dollars for some transactions that were actually less than \\$100 and vice versa - predicting on the order of low hundreds when the actual values were around \$1,000. Performing a log transformation on the output variable appeared to have better results, with the scatter appearing closer to linear between predicted and actual values. However, the outliers are still visible and following a different trend than the rest of the data set.

Considering the small number of outliers and the large effect they are having on the predictive ability of the models, I am comfortable simply dropping them from the dataset. The goal of the project is to make recommendations on which reward to offer to a given member based on their demographic data in order to maximize the amount of money the spend at Starbucks. The largest reward is a BOGO for \$10. If a member is making a purchase on the order of hundreds of dollars, I don't think a \\$10 reward is going to affect whether they make that purchase or not. I would rather focus on the smaller transaction amounts (less than \$41) where a \\$5 or \$10 reward could significantly impact their purchasing behavior. I will try dropping the outliers and fitting the models to the slightly reduced dataset.

In [87]:
# recreate the training and testing datasets without the outliers 
X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR = create_train_test_split(learning_df_IQR)
scaler_IQR = MinMaxScaler()
X_train_IQR = scaler_IQR.fit_transform(X_train_IQR)
X_test_IQR = scaler_IQR.transform(X_test_IQR)
In [469]:
# save MinMaxScaler in pickle file to be used in web app
pickle.dump(scaler_IQR,open('pickle_files/scaler.p','wb'))
In [88]:
# load MinMaxScaler from pickle file
scaler_IQR = pickle.load(open('pickle_files/scaler.p','rb'))
In [89]:
# test a default linear regression model
plot_pred_actual(LinearRegression(), X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
Training MSE: 39.63003503687678
Testing MSE: 39.45555022224974
Training MAE: 4.879128073216707
Testing MAE: 4.865294598511641
In [90]:
# test a Random Forest regressor with 100 estimators
plot_pred_actual(RandomForestRegressor(n_estimators=100), X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
Training MSE: 12.446902676635341
Testing MSE: 27.408655988579863
Training MAE: 2.5260685799658713
Testing MAE: 3.881212128628698
In [91]:
# increase the number of features using PolynomialFeatures
# then test a linear regression model on the increased number of features
poly_model2 = Pipeline([('poly', PolynomialFeatures(degree=2)),
                       ('linear', LinearRegression(fit_intercept=False))])
plot_pred_actual(poly_model2, X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
Training MSE: 37.53793711749128
Testing MSE: 37.34712108187887
Training MAE: 4.724236947940327
Testing MAE: 4.714485132758245
In [92]:
# try creating 3rd degree polynomial features with a linear model
poly_model3 = Pipeline([('poly', PolynomialFeatures(degree=3)),
                       ('linear', LinearRegression(fit_intercept=False))])
plot_pred_actual(poly_model3, X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
Training MSE: 35.63556448171289
Testing MSE: 9.02664169643495e+19
Training MAE: 4.5740292860263345
Testing MAE: 357658271.5195926
In [93]:
# try 2nd degree polynomial features with a random forest regressor
poly_model2 = Pipeline([('poly', PolynomialFeatures(degree=2)),
                       ('forest', RandomForestRegressor(n_estimators=100))])
plot_pred_actual(poly_model2, X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
Training MSE: 12.51547418102622
Testing MSE: 28.704860742236775
Training MAE: 2.5393907675837917
Testing MAE: 3.993734901508103
In [94]:
# try a bagging regressor with 10 estimators
plot_pred_actual(BaggingRegressor(n_estimators=10), X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
Training MSE: 13.190449590444976
Testing MSE: 28.869762898723938
Training MAE: 2.579328263942847
Testing MAE: 3.96461412836878
In [95]:
# try weighing the outputs of a linear, random forest, and bagging model
reg1=LinearRegression()
reg2=RandomForestRegressor(n_estimators=100)
reg3=BaggingRegressor(n_estimators=100,max_samples=0.4)
voter = VotingRegressor([('lr',reg1),('rf',reg2),('bg',reg3)],[.2,.4,.4])
plot_pred_actual(voter, X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
Training MSE: 15.59816561486974
Testing MSE: 26.432518958594503
Training MAE: 2.962466572937002
Testing MAE: 3.8788401623695403
In [96]:
# try a linear Support Vector regression model
plot_pred_actual(LinearSVR(), X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
Training MSE: 39.82673143538577
Testing MSE: 39.64223084979135
Training MAE: 4.863377277259713
Testing MAE: 4.847186592888953
In [97]:
# try a default Stochastic Gradient Descent regressor
plot_pred_actual(SGDRegressor(), X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
Training MSE: 39.63839886400559
Testing MSE: 39.475246239412726
Training MAE: 4.876583859985896
Testing MAE: 4.862517904067367

The default tree based (Bagging and Random Forest) regression models appear to have the best mean absolute error for the training (less than 3) and testing (less than 4) data sets of the various ensemble methods that I have tested. All of the linear models (Linear Regression, Linear Regression with Polynomial Features, Linear SVR, and SGD) all had mean absolute errors greater than 4 for both the training and testing data sets. Furthermore, the scatter plots of predicted vs. actual values for the training data of the tree based models show relationships that are closest to linear. They tend to underpredict the larger amounts (>\$30) but there is a much tighter clustering for the low amounts. Again, I find the ability to predict the smaller transaction amounts to be much more important than the larger amounts since a reward could have more effect on purchasing decisions for smaller transactions. After reading some posts about the difference between Bagging and Random Forests, I have learned that Bagging uses all of the features to determine the most effective split in the data, whereas Random Forests use a random subset of the features. There are only 17 features on which to train a model, and 14 of those are mutually exclusive (11 reward offers, including the 'no reward' option, and the 3 gender options) meaning that there are effectively only 5 features from which to make predictions for any given data point. With such a small number of features, I do not want to limit how many are used to train any one estimator, so I will choose to focus on tuning a Bagging regressor rather than a Random Forest regressor for my final model.

Model Refinement

There are two main hyperparameters for Bagging regression: number of estimators (n_estimators) and max_samples. I will perform a grid search on these two hyperparameters using mean absolute error as the scoring metric to tune my model.

In [179]:
# Run a grid search on Bagging Regressor hyperparameters
scorer = make_scorer(mean_absolute_error, greater_is_better=False)
bg = BaggingRegressor()
parameters = {'n_estimators':[10,100,200],'max_samples':[0.25,0.5,0.75,1.0]}
model = GridSearchCV(bg,parameters,scoring=scorer,cv=5)
model.fit(X_train_IQR, y_train_IQR)
Out[179]:
GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=BaggingRegressor(base_estimator=None, bootstrap=True,
                                        bootstrap_features=False,
                                        max_features=1.0, max_samples=1.0,
                                        n_estimators=10, n_jobs=None,
                                        oob_score=False, random_state=None,
                                        verbose=0, warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'max_samples': [0.25, 0.5, 0.75, 1.0],
                         'n_estimators': [10, 100, 200]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=make_scorer(mean_absolute_error, greater_is_better=False),
             verbose=0)
In [180]:
# view the parameters of the best estimator
model.best_estimator_
Out[180]:
BaggingRegressor(base_estimator=None, bootstrap=True, bootstrap_features=False,
                 max_features=1.0, max_samples=0.5, n_estimators=200,
                 n_jobs=None, oob_score=False, random_state=None, verbose=0,
                 warm_start=False)
In [98]:
# run the best_estimator model
plot_pred_actual(BaggingRegressor(n_estimators=200,max_samples=0.5), X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
Training MSE: 14.546004298038026
Testing MSE: 26.35518097956688
Training MAE: 2.8095859035814206
Testing MAE: 3.8361389339239076
In [99]:
# store the fitted model with the best parameters (200 estimators, 0.5 max_samples)
model, y_preds_train, y_preds_test = fit_test_model(BaggingRegressor(n_estimators=200,max_samples=.5), X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
Training MSE: 14.548376639996283
Testing MSE: 26.3410070610579
Training MAE: 2.8101417485829163
Testing MAE: 3.8364000953814394
In [100]:
# view the scatter in the predicted vs actual values for the test set
fig, ax = plt.subplots(figsize=(16,9))
plt.scatter(y_test_IQR,y_preds_test)
plt.xlabel('Actual Transaction Amounts',fontsize=12)
plt.ylabel('Predicted Transaction Amounts',fontsize=12)
plt.title('Predicted vs Actual Transaction Amounts for Testing Data',fontsize=16)
plt.show()

Tuning the Bagging regressor with Grid Search resulted in a slightly worse mean absolute error for the training data set (2.81 for the tuned model compared to 2.58 for the default model) but a slightly better score for the testing data set (3.83 vs 3.97). The default model uses 10 estimators with a max_samples value of 1.0, meaning that all of the samples could be drawn for a given estimator. The best estimator, meanwhile, had 200 estimators with a max_sample value of 0.5. The fact that the testing scored improved despite the training score getting worse means that the original model was probably slightly overfit on the training data. Increasing the number of estimators while decreasing the maximum number of samples that could be drawn for each estimator helped to reduce the variance of the model and therefore improve the model's predictive ability.

Results

Now that I have a model, I need to be able to make predictions for an individual user based on which reward is offered (if any). To do this I will write a function that takes a fitted model and the user demographic data and loops through all of the reward options, makes and stores the prediction for each option so that I can identify which reward predicts the largest transaction amount for that user.

In [102]:
# grab a single user to test
test_user=X_train_IQR[0]
test_user
Out[102]:
array([0.        , 0.        , 0.        , 0.        , 1.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.02409639, 0.        , 0.59133297, 1.        ,
       0.        , 0.        ])
In [103]:
# check the order of the columns in the training dataset
learning_df.columns
Out[103]:
Index([           'amount',                   0,                   1,
                         2,                   3,                   4,
                         5,                   6,                   7,
                         8,                   9,                  10,
                     'age',            'income', 'enrollment_tstamp',
                'gender_F',          'gender_M',          'gender_O'],
      dtype='object')
In [104]:
# the first column is the output variable, 'amount', followed by the reward_id columns 0-10.
# that means in the training dataset, the reward_ids are the first 11 columns
# set all the reward columns for the test user to 0
test_user[:11] = 0
test_user
Out[104]:
array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.02409639, 0.        , 0.59133297, 1.        ,
       0.        , 0.        ])
In [105]:
# set the first column (reward_id=0) to 1
test_user[0] = 1
# make a prediction on the transaction amount for the test user with no reward offered
model.predict(test_user.reshape(1,len(test_user)))
Out[105]:
array([3.3588006])
In [106]:
# make a prediction with only 1 reward active at a time
test_user_preds = np.array([]) # initialize an empty array
test_user[:11] = 0 # set the reward booleans to 0
for idx in range(11): # loop through each reward id
    test_user[idx]=1 # set the reward id to 1 (active)
    test_user_preds = np.append(test_user_preds,model.predict(test_user.reshape(1,17))) # make prediction
    test_user[idx]=0 # reset reward id to 0 (inactive) for next iteration
In [107]:
# extract the reward values from the portfolio_encoded df
reward = portfolio_encoded['reward'].values
reward = np.insert(reward,0,0) # insert a reward of 0 for reward id 0 (no reward)
reward # preview
Out[107]:
array([ 0,  5,  5, 10, 10,  3,  2,  2,  5,  0,  0])
In [108]:
# extract the difficulty values from the portfolio_encoded df
difficulty = portfolio_encoded['difficulty'].values
difficulty = np.insert(difficulty, 0, 0) # insert a difficulty of 0 for reward id 0 (no reward)
difficulty # preview
Out[108]:
array([ 0,  5,  5, 10, 10,  7, 10, 10, 20,  0,  0])
In [109]:
# compare predicted transaction amounts to difficulties to see if reward was met
reward_met = np.greater_equal(test_user_preds,difficulty)
reward_met
Out[109]:
array([ True, False,  True, False, False, False, False, False, False,
        True,  True])
In [110]:
# subtract the reward value from the transaction amounts of the reward types that were met
reward_adjusted_preds = np.subtract(test_user_preds, reward,out=test_user_preds,where=reward_met)
reward_adjusted_preds # preview
Out[110]:
array([3.3588006 , 4.43567198, 1.17154893, 3.13903732, 3.37214159,
       3.68502369, 2.66080423, 3.9578203 , 2.75774726, 5.01527083,
       3.0825431 ])
In [112]:
# plot the adjusted predicted transactions for each reward
fig, ax = plt.subplots(figsize=(16,9))
plt.bar(range(11),reward_adjusted_preds)
plt.xlabel('Reward ID',fontsize=12)
plt.ylabel('Predicted transaction amount (USD)',fontsize=12)
plt.title('Predicted transaction amount vs Reward ID',fontsize=16)
plt.show()
In [116]:
def make_member_predictions(model, member_inputs):
    '''
    Makes predictions for the amount a member will spend based on each of the 10 reward offers or no reward offer
    Deducts the reward amount if the predicted amount is greater than the difficulty to complete the reward
    
    Inputs:
    model: an sklearn model fit to training data
    member_inputs (2D array): normalized values for the input features with shape (# members, 17 features)
        the features (columns) must be in this order: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 'age','income',
                                                     'enrollment_tstamp','gender_F','gender_M','gender_O'
    
    Returns:
    reward_adjusted_preds (2D array): the reward_adjusted predicted transaction amounts for the 11 reward options (no reward + 10 reward types)
    - shape (# members, 11 columns)
    '''
    inputs = np.copy(member_inputs) # create a copy to avoid changing the original inputs
    member_preds = np.zeros((inputs.shape[0],11)) # initialize array to store member predictions
    for member_num in range(inputs.shape[0]): # iterate through the inputs for each member
        inputs[member_num,:11] = 0 # reset the reward booleans to 0 so that only 1 reward is active at a time
        for idx in range(11): # loop through each reward_id (0 through 10)
            inputs[member_num,idx]=1 # set the reward_id boolean to 1 (True)
            # append the prediction to the member_preds array
            member_preds[member_num,idx] = model.predict(inputs[member_num,:].reshape(1,17))
            inputs[member_num,idx]=0 # set the reward_id back to 0
    
    reward_met = np.greater_equal(member_preds,difficulty)
    reward_adjusted_preds = np.subtract(member_preds, reward,out=member_preds,where=reward_met)
    return reward_adjusted_preds
In [117]:
member_preds = make_member_predictions(model,X_train_IQR[:2])
member_preds
Out[117]:
array([[ 3.3588006 ,  4.43567198,  1.17154893,  3.13903732,  3.37214159,
         3.68502369,  2.66080423,  3.9578203 ,  2.75774726,  5.01527083,
         3.0825431 ],
       [27.9615573 , 20.38349871, 21.74482026, 17.55006657, 16.25764252,
        21.16777615, 26.62271435, 24.96608192, 19.69099794, 26.72603276,
        27.9520149 ]])
In [118]:
def transform_demographic_data(age,income,enrollment_date,gender):
    '''
    Take raw demographic data (age, income, enrollment date, gender) and transform it:
    - convert date to timestamp, encode gender, scale all values
    - add 11 zeros in front of data to create input array for `make_member_predictions`
    
    Inputs:
    age (int or float): member age in years or NaN if age is unknown (if unknown will use mean age)
    income (float): member annual income in USD or NaN if unknown (if unknown will use mean income)
    enrollment_date (datetime object): the date (mm-dd-YYYY) that a member enrolled
    gender (string): 'F','M','O', or 'U' (for unknown)
    
    Returns:
    member_inputs (1x17 array): normalized array containing values for:
    reward_ids (0-10), age, income, enrollment date, and genders (F,M,O) 
    '''
    mean_age = learning_df['age'].mean() # calculate the mean age from the dataset (used to replace nulls)
    mean_income = learning_df['income'].mean() # calculate mean income from the dataset (used to replace nulls)
    if pd.isnull(age):
        age = mean_age # if age is null, replace with mean age
    if pd.isnull(income):
        income=mean_income # if income is null, replace with mean income
    tstamp = enrollment_date.timestamp() # convert enrollment date from datetime object to timestamp (for scaling)
    # create the gender encoding array based on the gender provided (if any)
    if gender == 'F': 
        gen_array = np.array([1,0,0])
    elif gender == 'M':
        gen_array = np.array([0,1,0])
    elif gender == 'O':
        gen_array = np.array([0,0,1])
    else:
        gen_array = np.array([0,0,0])
    member_inputs_raw = np.zeros((1,17)) # create 1x17 array of zeros to initialize the member inputs
    member_inputs_raw[0,11:14] = [age, income, tstamp] # replace elements 11-13 with the age, income and tstamp
    member_inputs_raw[0,14:] = gen_array # replace the last 3 elements with the gender encoding array
    member_inputs = scaler_IQR.transform(member_inputs_raw) # scale the inputs using the MinMaxScaler
    return member_inputs.reshape(1,17) # return the inputs in a 1x17 array
In [119]:
# test transform_demographic_data
member_inputs = transform_demographic_data(27,75000,datetime.strptime('2014-06-15','%Y-%m-%d'),'M')
member_inputs
Out[119]:
array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.10843373, 0.5       , 0.17608338, 0.        ,
        1.        , 0.        ]])
In [120]:
# test that transform_demographic_data can handle NaN's
member_inputs = transform_demographic_data(np.nan,np.nan,datetime.strptime('2014-06-15','%Y-%m-%d'),'M')
member_inputs
Out[120]:
array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.41824055, 0.35381678, 0.17608338, 0.        ,
        1.        , 0.        ]])
In [121]:
# test that make_member_predictions can handle output of transform_demographic_data
member_preds = make_member_predictions(model,member_inputs)
member_preds
Out[121]:
array([[0.55988238, 2.33108821, 1.2849325 , 1.8552575 , 1.9426375 ,
        2.20215   , 2.23422833, 2.93338833, 1.41367655, 1.20213821,
        1.37729286]])
In [122]:
def plot_reward_preds(preds):
    '''
    Creates a bar chart of predicted transaction amounts for all 11 reward options
    
    Input:
    preds (1x11 array): the transaction amounts for the reward options, typically the output of make_member_predictions
    
    Returns:
    None
    '''
    fig, ax = plt.subplots(figsize=(16,9)) # create a large figure
    plt.bar(range(11),preds.flatten()) # add a bar plot
    ax.set_xticks(range(11))
    plt.xlabel('Reward ID',fontsize=12)
    plt.ylabel('Prediction transaction amount (USD)',fontsize=12)
    plt.title('Predicted transaction amounts vs Reward ID',fontsize=16)
    plt.show()
In [123]:
# test the plot_reward_preds function
plot_reward_preds(member_preds)
In [125]:
# create a new function to identify the best reward ID
def best_reward(preds, show_plot=True):
    '''
    Determines which reward ID results in the maximum transaction amount
    Plots the predicted transaction amounts with the maximum amount in red if 'show_true' is set to True
    
    Inputs:
    preds (1x11 array): the transaction amounts for the reward options, typically the output of make_member_predictions
    show_plot (boolean): if True, plots the predicted transaction amounts for each reward option with the best reward in red
    
    Returns:
    best_reward (int): the reward id corresponding to the maximum predicted transaction amount
    '''
    best_reward = np.argmax(preds.flatten()) # identify the index (aka reward_id) corresponding to the max trans. amt.
    if show_plot: # make a bar plot if show_plot is True
        fig, ax = plt.subplots(figsize=(16,9))
        barlist = plt.bar(range(11),preds.flatten())
        barlist[best_reward].set_color('red')
        ax.set_xticks(range(11))
        plt.xlabel('Reward ID',fontsize=12)
        plt.ylabel('Prediction transaction amount (USD)',fontsize=12)
        plt.title('Predicted transaction amounts vs Reward ID',fontsize=16)
        plt.show()
    return best_reward
In [126]:
# test the best_reward function
best_reward_id = best_reward(member_preds)
print('The best reward is {}.'.format(best_reward_id))
The best reward is 7.

Now that I have a tuned model and a function to identify the best reward option for an individual, I want to see if there are any patterns in the demographic data, i.e:

  • Are different rewards recommended to different groups of people?

To do that, I will grab the age, income, enrollment date, and gender for each member in the profile data set and use the transform_demographic_data, make_member_predictions, and best_reward functions to assign a recommended reward. After that, I can group the members by different demographics and see which reward is recommended for each group.

In [128]:
# create a dataframe with only the demographic data from the profile df
member_preds_df = profile[['age','income','became_member_on','gender']]
# convert the 'became_member_on' column to a datetime object and store in 'enrollment_date'
member_preds_df['enrollment_date'] = member_preds_df['became_member_on'].apply(lambda x: datetime.strptime(str(x),'%Y%m%d'))
# drop the 'became_member_on' column
member_preds_df = member_preds_df.drop(columns='became_member_on')
# preview df
member_preds_df.head()
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
Out[128]:
age income gender enrollment_date
0 NaN NaN NaN 2017-02-12
1 55.0 112000.0 F 2017-07-15
2 NaN NaN NaN 2018-07-12
3 75.0 100000.0 F 2017-05-09
4 NaN NaN NaN 2017-08-04
In [309]:
# # create a new column with the best reward for each member (takes a while)
# member_preds_df['best_reward'] = member_preds_df.apply(lambda x: best_reward(make_member_predictions(model,transform_demographic_data(x['age'],x['income'],x['enrollment_date'],x['gender'])),show_plot=False),axis=1)
# # preview df
# member_preds_df.head()
Out[309]:
age income gender enrollment_date best_reward
0 NaN NaN NaN 2017-02-12 10
1 55.0 112000.0 F 2017-07-15 10
2 NaN NaN NaN 2018-07-12 4
3 75.0 100000.0 F 2017-05-09 9
4 NaN NaN NaN 2017-08-04 6
In [310]:
# # save the member_preds_df in a pickle file to skip the previous step in future runs
# pickle.dump(member_preds_df,open('pickle_files/member_preds_df.p','wb'))
In [129]:
# load pickle file
member_preds_df = pickle.load(open('pickle_files/member_preds_df.p','rb'))
# preview df
member_preds_df.head()
Out[129]:
age income gender enrollment_date best_reward
0 NaN NaN NaN 2017-02-12 10
1 55.0 112000.0 F 2017-07-15 10
2 NaN NaN NaN 2018-07-12 4
3 75.0 100000.0 F 2017-05-09 9
4 NaN NaN NaN 2017-08-04 6
In [130]:
def plot_reward_freqs(counts, demographic):
    '''
    Determines which reward ID is recommended the most
    Plots the recommendation frequencies with the maximum frequency in red
    
    Inputs:
    counts (1x11 array): the counts for the reward options
    demographic (string): the demographic that the counts are grouped by; will be displayed in title
    
    Returns:
    None
    '''
    best_reward = np.argmax(counts.flatten()) # identify the index (aka reward_id) of the most frequent reward
    fig, ax = plt.subplots(figsize=(16,9)) # create a large figure
    barlist = plt.bar(range(11),counts.flatten()/sum(counts)) # add a bar plot of frequencies
    barlist[best_reward].set_color('red') # highlight the most frequent reward in red
    ax.set_xticks(range(11))
    plt.xlabel('Reward ID',fontsize=12)
    plt.ylabel('Recommendation Frequency',fontsize=12)
    plt.title('Distribution of Reward Recommendations for {}'.format(demographic),fontsize=16)
    plt.show()  
In [131]:
# make a bar chart of the best reward counts by reward id
plot_reward_freqs(member_preds_df['best_reward'].value_counts(sort=False).values,'All Members')
In [132]:
def plot_mult_reward_freqs(counts, demographic, legend):
    '''
    Determines which reward ID is recommended the most for two demographic groups
    Plots the recommendation frequencies with the maximum frequency in red/orange, respectively
    
    Inputs:
    counts (2x11 array): the counts for the reward options
    demographic (string): the demographic that the counts are grouped by; will be displayed in title
    legend (list of strings): 2 element list corresponding to the demographic distinction between the two groups, will be displayed in the legend
    
    Returns:
    None
    '''
    best_reward1 = np.argmax(counts[0].flatten()) # identify the index (aka reward_id) of the most frequent reward in demo group 1
    best_reward2 = np.argmax(counts[1].flatten()) # identify the index (aka reward_id) of the most frequent reward in demo group 2
    fig, ax = plt.subplots(figsize=(16,9)) # make a large figure
    width = 0.4 # set the bar width
    # plot the first demo group in blue
    barlist1 = plt.bar(np.arange(11)-width/2,counts[0].flatten()/sum(counts[0]),width=width,color='blue') 
    barlist1[best_reward1].set_color('red') # highlight the most frequent reward of demo group 1 in red
    # plot the second demo group in green
    barlist2 = plt.bar(np.arange(11)+width/2,counts[1].flatten()/sum(counts[1]),width=width,color='green')
    barlist2[best_reward2].set_color('orange') # highlight the most frequent reward of demo group 2 in orange
    ax.set_xticks(range(11)) # label the reward ids
    plt.xlabel('Reward ID',fontsize=12)
    plt.ylabel('Recommendation Frequency',fontsize=12)
    plt.title('Distribution of Reward Recommendations by {}'.format(demographic),fontsize=16)
    # create a custom legend with blue/red rectangles for demo group 1 and green/orange rectangles for demo group 2
    blue_patch = mpatches.Patch(color='blue')
    red_patch = mpatches.Patch(color='red')
    green_patch = mpatches.Patch(color='green')
    orange_patch = mpatches.Patch(color='orange')
    plt.legend([(blue_patch,red_patch),(green_patch,orange_patch)],[legend[0],legend[1]],numpoints=1,handler_map={tuple:HandlerTuple(ndivide=None,pad=0)})
    plt.show()
    
In [133]:
# split the member_preds_df into younger/older than mean age
mean_age = member_preds_df['age'].mean()
younger = member_preds_df.loc[member_preds_df['age']<=mean_age,'best_reward'].value_counts().sort_index().values
older = member_preds_df.loc[member_preds_df['age']>mean_age,'best_reward'].value_counts().sort_index().values
age_demo = 'Age (Mean is {:2.2f} years)'.format(mean_age)
# plot the reward frequencies by age
plot_mult_reward_freqs([younger,older],age_demo,['Younger than mean age','Older than mean age'])
In [134]:
# split the member_preds_df into less/greater than mean income
mean_income = member_preds_df['income'].mean()
poorer = member_preds_df.loc[member_preds_df['income']<=mean_income,'best_reward'].value_counts().sort_index().values
richer = member_preds_df.loc[member_preds_df['income']>mean_income,'best_reward'].value_counts().sort_index().values
income_demo = 'Income (Mean is ${:.0f})'.format(mean_income)
# plot the reward frequencies by income
plot_mult_reward_freqs([poorer,richer],income_demo,['Below average income','Above average income'])
In [135]:
# split the member_preds_df into before/after 2017 enrollment
begin2017 = datetime(2017,1,1)
before2017 = member_preds_df.loc[member_preds_df['enrollment_date']<begin2017,'best_reward'].value_counts().sort_index().values
after2017 = member_preds_df.loc[member_preds_df['enrollment_date']>=begin2017,'best_reward'].value_counts().sort_index().values

# plot the reward frequencies by enrollment date
plot_mult_reward_freqs([before2017,after2017],'Enrollment Date',['Before 2017','2017 or after'])
In [136]:
# split the member_preds_df into Female/Male
female = member_preds_df.loc[member_preds_df['gender']=='F','best_reward'].value_counts().sort_index().values
male = member_preds_df.loc[member_preds_df['gender']=='M','best_reward'].value_counts().sort_index().values

# plot the reward frequencies by gender (omitting 'Other')
plot_mult_reward_freqs([female,male],'Gender',['Female','Male'])

Discussion

Looking at all of the members in the profile dataset, the most commonly recommended reward type was Reward 0, or no reward. That was followed closely by Reward 8 (10-day 25\% discount if you spend \$20), Reward 10 (4-day informational) and Reward 9 (3-day informational). After that, there is a large drop in recommendation frequency for the remaining 7 reward types.

To get a better understanding of how the different demographic features impacted reward recommendations, I also plotted bar charts splitting each demographic into two groups. The first group of bars is plotted in blue with the most common reward recommendation plotted in red, while the second group was plotted in green with the most common reward recommendation plotted in orange. This helped me to quickly understand the differences in the distribution of reward recommendations for each group as well as which reward was most popular for each group.

Age
I split members based on whether they were younger or older than the mean age of 54.39 years. Younger members were recommended Reward 8 the most (>25\%) while older members were recommended Reward 0 (>25\%) the most.

Income
I split members by income in the same way as age, whether their income was less than or greater than the mean income of \$65,405. Lower income members were recommended Reward 8 (>35\%) the most and higher income members were recommended Reward 0 (~30\%) the most.

Enrollment Date
I split members by enrollment date based on whether or not they were a member before 2017. The trends in member enrollment showed increasing enrollment numbers from the start of the program in 2013 through 2017 before dropping in 2018 (likely due to an incomplete year of enrollments in the data set). Splitting the members into 2013-2016 and 2017-2018 groups appeared to roughly create two evenly sized groups. The pre-2017 members were recommended Reward 0 (~25\%) the most while members enrolling in 2017 or after were recommended Reward 8 (>25\%) the most.

Gender
Splitting members by gender was easy because it was already a categorical feature. I only included female and male members because they made up the majority of the data set. 'Other' gender types only made up 1\% of the data set so I didn't expect it to have a strong influence on reward predictions. Female members were recommended Reward 0 (~30\%) the most while male members were recommended Reward 8 (~30\%) the most.

Reward 0 and Reward 8 were the most recommended for each demographic distinction. Oddly enough, however, the same reward was not recommended for both groups within a given demographic. This tells me that each demographic did have some impact on the reward recommendations - a good sign for the usefulness of the model.

Looking back at the recommendation frequencies for the entire member data set, one trend that stands out is the fact that three of the top 4 most recommended reward options (Rewards 0, 9, and 10) do not actually offer any reward. Reward 0 represents no reward offer while 9 and 10 are simply informational. Reward 8 is the only monetary reward that is recommended to more than 10\% of the members in the data set and it happens to have the largest minimum spending amount to receive the reward. This reward offers a 25\% discount when a member spends \$20, meaning that the net amount spent by the member is \\$15. I believe that these 4 rewards are recommended the most partly because of how I defined the predicted transaction amounts and the best reward.

For a given member, the make_member_predictions function uses the tuned Bagging regression model to make predictions for the amount a member will spend in a given transaction based on the reward (if any) that is offered. The function returns reward-adjusted transaction amounts, meaning that if a predicted transaction amount exceeds the 'difficulty' (minimum amount required to receive the reward), then the reward amount is subtracted from the predicted transaction amount. I chose to use reward-adjusted transaction amounts as a way to normalize the reward types. For example, if a member is expected to spend \$4 without a reward offer but would spend \\$6 if they received a \$5 BOGO, then it would be better to not offer that member a reward because the reward amount offsets most of the amount they are spending. This leads to the goal of the reward program and how one defines the best reward. In my opinion, the goal of the reward program is to maximize the amount of money spent at Starbucks by members. I assume that reward money is equivalent to transactional money, when in reality it may be significantly less. Starbucks may only value transactional money and not care how much reward money is given out; or the unit cost of that second drink in a BOGO is actually less than the marginal increase in money spent in a single transaction with a BOGO compared to without an offer. In either of those cases, the definition of the best reward offer would have to be adjusted. Because of the assumptions I have made in calculating predicted transaction amounts, any offer where the reward is equal to the minimum spend amount (i.e. the BOGO offers) is greatly weakened by my adjustment. I believe this partly explains why three of the top 4 most recommended offers do not have any reward associated with them (and by extension do not have an adjustment) and the other popular offer in the top 4 has the largest reward-adjusted minimum spend.

Conclusion

The goal of this project was to make recommendations on which reward to offer a member based on their demographic data (age, income, enrollment date, and gender). To accomplish this goal, I was provided three data sets: information about Starbucks rewards member, information about the types of rewards that were offered, and an event log detailing when rewards were sent to members, viewed and completed by members, and the transaction amounts made by members during a 30-day test period. I cleaned these data sets by encoding categorical features and converting data types from dictionaries, lists, and strings to floats, integers, and datetime objects. I then consolidated the event log data so that I knew which reward type (if any) was active for every transaction. This step proved to be one of the most challenging parts of the project. I had to find ways to compare values from different columns across multiple rows of the event log for a subset of the data corresponding to each member. Due to the size of the dataset (17,000 unique members with over 300,000 rows) I had to be very thoughtful with how to efficiently store and recall the relevant information during the feature extraction process. Once I was done, I could finally begin to train various regression models and test their ability to predict transaction amounts based on reward type and demographic data. However, I quickly discovered that the transactional data had some very extreme outliers that made it difficult to train the models or evaluate their performance. This posed another major challenge in this project - how to handle the outliers in the output variable. After doing some research, I learned about several techniques to try to handle the outliers, ultimately deciding to drop them from the modeling dataset because they did not reflect the types of transactions that I was trying to predict. Once the outliers were removed, I continued testing various regression models on transactions between \$0-\\$41 before settling on one - tree based regression with Bagging - to tune using Grid Search. My tuned model had a mean absolute error of \$2.81 on the training set and a mean absolute error of \\$3.83 on the testing data set, numbers with which I was satisfied. I used my tuned model to make predictions for each member in the original data set and each reward type to determine which was the best reward type for each member. I then aggregated the recommended reward by each of the demographics to better understand the recommendations that my model was making. The model most often recommended either no reward or the offer of a 25\% discount on a \$20 transaction, followed by the two informational offers. I addressed the frequency of these reward recommendations in the Results discussion section.

Improvement

There are two major limitations associated with my modeling approach. First, I assume that each transaction is independent of other transactions. This means that I do not consider the possibility that multiple transactions could occur while an offer is active, which affects both the discount and informational offer types. My model predicts the amount spent in a single transaction and I use those predictions to make a reward recommendation. In reality, there may be members who make multiple transactions over a short period of time that add up to an amount that is greater than the amount predicted when no offer is given. I would expect that accounting for multiple offers would increase the frequency of recommending discount and informational offers.

The second major limitation to my model is that it is only trained on transaction amounts rather than the number of transactions. By focusing on the amount, I effectively assumed that the likelihood of a transaction occurring was constant for each reward type. A more concrete way to think about this is by imagining a customer standing at a register about to place an order. The customer has already decided to purchase something (make a transaction) and now just has to decide what they want to purchase or how much they want to spend. The type of reward offered can definitely impact their purchasing behavior in this scenario - perhaps they were only planning to buy a small coffee but because they had a discount reward decided to buy a pastry as well. However, there is also a significant effect that is not captured by this approach - the members who would not have purchased anything if they had not received an offer. In order to incorporate the probability of a transaction into the reward recommendations, I need to perform some statistical hypothesis tests to determine which reward types have a significant increase in transaction rate compared to the null hypothesis (no reward offer). I would expect this modification to increase the frequency of offers with monetary rewards (BOGO and discount types) over the informational offers or no reward offer.